In this exploratory analysis we will explore a dataset from the company Prosper, who is part of the peer-to-peer lending industry. In this analysis we want to answer the following questions:
Univariate analysis
Multivariate Analysis
Let’s load the data set:
What is the structure of the dataset?
## [1] 113937 81
This data set contains 113,937 loans with 81 variables, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, Prosper score and the latest payment information.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
In this section we would like to analyze the characteristics of the loan provided on Prosper’s platform, in terms of loan origination date, loan amount, term, monthly payment, and current loan status.
## [1] 84672 81
There are 84,672 bonds listed since July 2009.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9094 13750 35000
The minimum loan amount is $1000, the maximum is $35,000 and the average loan amount is $8,337. However the median is $6,500. The difference between the average and median is proably due to a distribution skewed to the right.
When we look at the distribution we observe that there are peaks every $5000 ($5000, $10000, $15,000, $20,000, and $25,000). However it’s interesting to note there is a large number of loans with laon amounts equal to $4,000. One of the reasons why $4,000 loans are popular is because Prosper makes it harder to get loans over than $4,000.
The length of the loan expressed in months.
Prosper provides mostly 36 and 60-month loans.
summary(pr$MonthlyLoanPayment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 157.6 252.2 292.3 389.0 2252.0
Most loans have a mothly payment between $131.6 and $371.6. It’s very surprising that some loans have monthly payments equal to 0. There is probably a data quality issue as it shouldn’t be possible to have monthly payments equal to 0.
## Warning: Removed 11637 rows containing non-finite values (stat_bin).
If we change the binwidth to 1 and limit the x axis to $500 we can observe the most common monthly payment is around $175. This monthly payment probably corresponds to a $4000 loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1359 0.1875 0.1961 0.2574 0.3600
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The Borrower Rate seems quite uniformly distributed.
##
## Cancelled Chargedoff Completed
## 0 5326 19501
## Current Defaulted FinalPaymentInProgress
## 56576 997 205
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
Most of the loans are active but some have late payments.
What’s the percentage of people who defaulted?
## [1] 0.01177485
Around 11.7% of loans have defaulted.
What’s the percentage of people who had late payments?
## [1] 0.02441185
##
## Cancelled Chargedoff Completed
## 0 5326 19501
## Current Defaulted FinalPaymentInProgress
## 56576 997 205
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
Around 2.5% of loan have late payments.
A custom risk score was created by Prosper in order to assess risk. The score ranges from 1-11, with 11 being the best, or lowest, risk score.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 5.945 8.000 11.000
##
## 1 2 3 4 5 6 7 8 9 10 11
## 989 5766 7639 12594 9807 12271 10581 12008 6843 4718 1456
Most of the loans have a score around 6.
People use Prosper mostly for Debt Consolidation (1), Other (4), Home Improvement (2), or Business (3).
What’s the distribution of the IncomeRange of Propsper’s users? Most of the people using Prosper have an income between $25000 and $75000.
What’s the distribution of the debt to income ratio?
summary(pr$DebtToIncomeRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.150 0.220 0.259 0.320 10.010 7281
## Warning: Removed 7453 rows containing non-finite values (stat_bin).
The debt to income ratio is skewed to the right with a median of 0.22.
What’s the distribtution of the credit score?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 699.4 720.0 880.0
The distribution of the credit score is centered around 700.
What’s the distribtution of Monthly Income?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3442 5000 5933 7083 1750000
## Warning: Removed 8163 rows containing non-finite values (stat_bin).
On average people who use Prosper have a monthly income of $5,935.
There are lots of features in the dataset that describe characteristics of the loan, like the monthly payment, loan origination amount, borrower rate, term, etc. It would be interesting to see if the monthly payment amount remains the same for a same loan origination amount and term.
Prosper has a Prosper score that quantifies the loan risk. It would be interesting to see if it has an impact on the montly payment for a same loan origination amount and term.
We can also try to identify if there are characteristics of the borrower that influence the Prosper score.
Finally we can explore how Prosper’s portfolio has evolved.
We have observed that certain loans have a monthly payment equal to 0. There must be a data qualilty issue. For the rest of the analysis, we will exclude these loans.
The number of loans issued by loan orgination date have highly increased between 2009 and 2014 even though there was a drop in 2013. Most of the loans issued had a term of 36 months. Prosper issued 60-month loans for the first time in 2011. They stopped issuing 12-month loans in 2013.
cor.test(pr$CreditScoreRangeLower,pr$ProsperScore)
##
## Pearson's product-moment correlation
##
## data: pr$CreditScoreRangeLower and pr$ProsperScore
## t = 115.18, df = 84188, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3631118 0.3747826
## sample estimates:
## cor
## 0.3689617
There are lots of variables correlated in the data set. The most correlated variables (corr = 0.911) are loan origination amount and monthly loan payment, which is not surprising.
Borrower rate is negatively correlated with Prosper score and credit score.
Debt income ratio is slighlty correlated with credit score.
It’s interesting to highlight that credit score and Prosper score are highly correlated (0.386). However, an important variance remains unexplained, which means that Prosper doesn’t only use the credit score to assess risk.
Prosper score is also correlated with loan origination amount and monthly payment.
We can explore how the number of loans, loan origination amount, and monthly payment have evolved between 2009 and 2014.
To do so, we can create three metrics: number of loans, average monthly payment, and average loan origination amount by loan origination date and Term. The average loan origination amount has increased since 2009 for loans with terms of 12 and 36 monthly, until Prosper stopped offering 12-month loans in 2013.
The 60-month loans have started with a very high loan origination amount in 2011, dropped in 2012, then increased steadily in 2013 and 2014.
There are three clusters of points: one is the 12-month loan, another is the 30-month loan, and the last cluster is the 60-month loan.
We can try to explain the variance.
We observe that the variance is explained by risk, which is represented by the ProsperScore. The bottom of the scatter plot is dominated by loans with a ProsoperScore equal to 11, which represents loans with low risks. The top of the scatter plot is dominated by loans with a ProsoperScore equal to 4, which represents loans with higher risks. Loans with a loan amount higher than $25000 are mostly dominated by a ProsperScore equal or superior to 0.
##
## FALSE TRUE
## 83208 982
Based on the boxplot above, it doesn’t seem that loans that defaulted and loans that didn’t default have a very different Prosper Scores.
We have observed that the Prosper score has an influence on montly payments for a same loan origination amount and same term. This Score should be a way to quantify the probability that the loan is going to default or not. However when we look at the Prosper Score it doesn’t seem be a great indicator for the investor if he should invest or not.
The plots above describe how Prosper’s porfolio has involved between 2009 and 2014. The portfolio has drastically changed between 2009 and 2014. In 2009, Prosper issued only 36-month loans. In 2011, Prosper started to issue 12-month and 60-months loan, but stoped issuing 12-month loans in 2013. It seems that Prosper has really changed their underwriting strategies. Between 2011 and 2013, the average loan origination amount has drastically increased, which could expose the lender to a higher risk of loss. It’s perhaps one of the reasons why Prosper stopped issuing 12-month loans.
Most of loans are 36-month loans, and the average loan origination amount has increased drastically between 2009 and 2014.
The 60-month loans started with a very high loan origination amount in 2011, dropped in 2012, then increased steadily in 2013 and 2014.
If we look at 36-month loans by monthly loan payment and loan origination amount, we notice that there is a linear relation. However, the monthly payment is influenced by the Prosper score for a same loan orgination amount and same term. We can also identify a couple of outliers where, for example, a same loan orignation amount and montly loan payment yields a Proser score around 10 instead of being around 2 or 3. It means that in order to estimate a monthly payment for a same loan origination amount and term, Prosper uses factors other than just those used for creating of the Prosper Score.
The plot above displays a box plot for Prosper Score by the boolean value, defaulted or not. The third quartile for the loans that defaulted is a little bit lower than the loans that didn’t default. However the first quartile and median are equivalent for loans that defaulted and loans that didn’t default. So it seems that the Prosper score is not a good variable to use to determine if an investor should invest or not in a specific loan.
We have been able to analyze that Prosper’s portfolio has really changed between 2009 and 2014.
It seems that Prosper doesn’t only use credit scores to assess risk but also several other variables that could be in the dataset, in addition to external data not included in the dataset available.
Instead of just using the Prosper score to decide whether or not to invest, it would be interesting to build a predictive model that quantifies if the person is going to default on their loan. We could use, for example, a logistic regression. However there are lots of variables that are correlated, such as loan origination amount, credit score, and Prosper score. It’s important to carefully choose which variables should be included in the model.